從串行的 CPU 程式設計過渡到 GPU 程式設計,需要一次範式轉變:從逐元素迭代轉為 基於區塊的執行。我們不再將資料視為一連串的標量,而是視為可排程以充分利用硬體頻寬的「區塊」集合。
1. 記憶體受限與運算受限
一個核心的瓶頸取決於數學運算次數與記憶體存取次數的比率。 向量加法通常屬於記憶體受限 因為每進行三次記憶體操作(兩次載入、一次儲存)才執行一次加法。硬體花費在等待 DRAM 的時間,遠多於實際計算的時間。
2. BLOCK_SIZE 的角色
BLOCK_SIZE 定義了平行運作的細緻程度。如果太小,我們無法充分運用 GPU 宽廣的執行通道。適當的大小能確保足夠的「飛行中工作」,以達到記憶體匯流排的飽和狀態。
3. 透過佔用率隱藏延遲
佔用率 是 GPU 上活躍區塊的數量。雖然這不是最終目標,但它讓排程器能在某個區塊等待從顯存取得高延遲資料時,切換至另一個新區塊來執行運算。
4. 硬體使用效率
為了最大化效能,我們必須讓自己的 BLOCK_SIZE 與 GPU 架構的記憶體合併規則保持一致,確保連續的線程能存取連續的記憶體位址。
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
For a kernel that adds two vectors ($out = x + y$), what is the most likely bottleneck on modern GPUs?
Arithmetic Throughput
Memory Bandwidth
Register Pressure
Shared Memory Latency
✅ Correct!
Vector addition involves very little math compared to the amount of data moved (3 memory ops per 1 add), making it memory-bound.❌ Incorrect
Arithmetic throughput is rarely the bottleneck for simple element-wise operations like addition.QUESTION 2
What is the primary purpose of 'Occupancy' in the GPU execution model?
To ensure every thread runs as fast as possible.
To hide memory latency by keeping work in flight.
To increase the clock speed of the compute units.
To reduce the power consumption of the HBM.
✅ Correct!
High occupancy allows the GPU to switch to active threads while others wait for data from global memory.❌ Incorrect
Occupancy doesn't change thread speed or clock frequency; it focuses on scheduler efficiency.QUESTION 3
Which of the following describes 'Memory-Bound' behavior?
The GPU is waiting for the memory bus to deliver data.
The GPU has exhausted its available VRAM.
The kernel is performing too many complex floating-point operations.
The CPU cannot launch kernels fast enough.
✅ Correct!
Memory-bound kernels are limited by the speed of data transfer from DRAM/HBM to the registers.❌ Incorrect
Exhausting VRAM is an Out-of-Memory error, not a 'memory-bound' performance bottleneck.QUESTION 4
What happens if the BLOCK_SIZE is set too small?
The kernel will fail with a memory error.
The GPU fails to utilize its wide SIMD execution lanes.
The memory bandwidth increases significantly.
Register pressure becomes too high.
✅ Correct!
Small block sizes result in underutilization because the hardware's execution units expect many threads to work in parallel.❌ Incorrect
Small block sizes actually reduce register pressure but hurt throughput.QUESTION 5
In the logistics warehouse analogy, what represents the 'Blocks'?
The individual items.
The workers.
The organized pallets.
The delivery trucks.
✅ Correct!
Organizing items into pallets (Blocks) ensures efficient transport and processing by workers (Compute Units).❌ Incorrect
The trucks represent the memory bus; the workers represent the compute units.Case Study: Bottleneck Analysis
Identifying Kernel Constraints
You are profiling three kernels: a Vector Addition kernel, a Deep Matrix Multiplication (GEMM) kernel, and a kernel that performs ReLU on a matrix. You need to categorize their bottlenecks based on hardware utilization theory.
Q
1. For each kernel (Vector Add, Matrix Multiply, 4-element Vector Add), decide whether the bottleneck is likely arithmetic throughput, memory bandwidth, or launch overhead.
Solution:
1. **Vector Addition**: Memory Bandwidth (low math-to-memory ratio). 2. **Deep Matrix Multiply**: Arithmetic Throughput (high $O(N^3)$ compute vs $O(N^2)$ memory). 3. **4-element Vector Add**: Launch Overhead (the time to start the GPU kernel outweighs the tiny workload).
1. **Vector Addition**: Memory Bandwidth (low math-to-memory ratio). 2. **Deep Matrix Multiply**: Arithmetic Throughput (high $O(N^3)$ compute vs $O(N^2)$ memory). 3. **4-element Vector Add**: Launch Overhead (the time to start the GPU kernel outweighs the tiny workload).
Q
2. Determine the bottleneck for a ReLU operation on a large matrix.
Solution:
The bottleneck for **ReLU** on a matrix is **Memory Bandwidth**. Since the operation is a simple comparison ($max(0, x)$), it is extremely computationally cheap, meaning performance is dictated by how fast the GPU can read the matrix from and write it back to global memory.
The bottleneck for **ReLU** on a matrix is **Memory Bandwidth**. Since the operation is a simple comparison ($max(0, x)$), it is extremely computationally cheap, meaning performance is dictated by how fast the GPU can read the matrix from and write it back to global memory.